feat(spark): implement substring function by cht42 · Pull Request #19805 · apache/datafusion

cht42 · 2026-01-14T10:28:12Z

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Implementation of spark substring function.

Spark implementation:

Are these changes tested?

yes

Are there any user-facing changes?

cht42 · 2026-01-14T10:29:15Z

datafusion/spark/benches/substring.rs

@@ -0,0 +1,207 @@
+// Licensed to the Apache Software Foundation (ASF) under one


copied the same bench as https://github.com/apache/datafusion/blob/main/datafusion/functions/src/unicode/substr.rs

Here is a comparison of the perf between the DF built-in and the new spark substring expressions. The perf is comparable

Benchmark Spark substring Time DF substring Time

SHORTER THAN 12/substr_string_view [size=1024, strlen=12] [9.9232 µs 10.026 µs 10.152 µs] [7.2600 µs 7.2906 µs 7.3209 µs]

SHORTER THAN 12/substr_string [size=1024, strlen=12] [7.6418 µs 7.7211 µs 7.8093 µs] [8.4374 µs 9.0174 µs 10.083 µs]

SHORTER THAN 12/substr_large_string [size=1024, strlen=12] [8.1936 µs 8.2795 µs 8.3950 µs] [8.8816 µs 8.9169 µs 8.9539 µs]

LONGER THAN 12/substr_string_view [size=1024, count=64, strlen=128] [14.973 µs 15.169 µs 15.429 µs] [12.647 µs 12.770 µs 12.904 µs]

LONGER THAN 12/substr_string [size=1024, count=64, strlen=128] [12.132 µs 12.183 µs 12.233 µs] [13.968 µs 14.022 µs 14.074 µs]

LONGER THAN 12/substr_large_string [size=1024, count=64, strlen=128] [12.259 µs 12.317 µs 12.379 µs] [13.995 µs 14.109 µs 14.280 µs]

SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=1024, count=6, strlen=128] [22.497 µs 22.631 µs 22.793 µs] [14.784 µs 14.901 µs 15.075 µs]

SRC_LEN > 12, SUB_LEN < 12/substr_string [size=1024, count=6, strlen=128] [20.755 µs 20.850 µs 20.978 µs] [15.804 µs 15.861 µs 15.914 µs]

SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=1024, count=6, strlen=128] [20.794 µs 21.045 µs 21.459 µs] [15.771 µs 15.816 µs 15.860 µs]

SHORTER THAN 12/substr_string_view [size=4096, strlen=12] [38.334 µs 38.569 µs 38.934 µs] [26.812 µs 27.032 µs 27.307 µs]

SHORTER THAN 12/substr_string [size=4096, strlen=12] [27.456 µs 27.548 µs 27.640 µs] [32.104 µs 32.449 µs 32.802 µs]

SHORTER THAN 12/substr_large_string [size=4096, strlen=12] [30.323 µs 30.777 µs 31.286 µs] [34.284 µs 34.514 µs 34.858 µs]

LONGER THAN 12/substr_string_view [size=4096, count=64, strlen=128] [55.663 µs 56.315 µs 57.106 µs] [47.091 µs 47.321 µs 47.639 µs]

LONGER THAN 12/substr_string [size=4096, count=64, strlen=128] [40.263 µs 40.768 µs 41.399 µs] [51.885 µs 52.049 µs 52.253 µs]

LONGER THAN 12/substr_large_string [size=4096, count=64, strlen=128] [43.083 µs 43.261 µs 43.452 µs] [51.839 µs 52.048 µs 52.348 µs]

SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=4096, count=6, strlen=128] [84.649 µs 85.064 µs 85.500 µs] [55.237 µs 55.607 µs 56.018 µs]

SRC_LEN > 12, SUB_LEN < 12/substr_string [size=4096, count=6, strlen=128] [76.146 µs 77.393 µs 79.422 µs] [60.728 µs 60.998 µs 61.235 µs]

SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=4096, count=6, strlen=128] [76.974 µs 77.166 µs 77.365 µs] [60.319 µs 60.630 µs 60.976 µs]

Jefffrey · 2026-01-15T04:12:18Z

datafusion/spark/src/function/string/mod.rs

+make_udf_function!(substring::SparkSubstring, substring);
+make_udf_function!(substring::SparkSubstring, substr);


Why do we add two here?

thats when I was debugging why the SLT tests kept falling back to the other implementation 😅 when I realized that I needed the custom expr planner it fixed things. will remove the dup